Prevent allocation of currently allocated resources #1046

milroy · 2023-07-08T06:38:36Z

Problem: issue #1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a free message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion time is exceeded, allowing the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP.

This PR adds a check for an existing allocation on each exclusive resource vertex during traversal pruning. The check only occurs for AT_ALLOC traversals to allow reservations and satisfiability traversals to succeed. The check for existing allocations prevents another job from receiving the resources and allows reservations and satisfiability checks to complete. It also ensures that the check will block AT_ALLOC traversals from matching the resources until the jobid is removed from the per-vertex allocations map via a free message from job-manager.

The check has the (I think) unavoidable side effect that reservations on the allocated subgraph will be allowed immediately after the at time of the AT_ALLOC traversal because the planner_multi_avail_time_first call will regard extra resources as available via incorrect aggregate counts. The side effect is minor, though, as the reservation will be blocked from transitioning to allocated until a free message cancels the existing allocation. During the subsequent scheduling loops, if the AT_ALLOC traversal fails the reservation will be pushed back during the AT_ALLOC_ORELSE_RESERVE traversal. That will continue until matching resources become available.

The PR fixes the reproducer from flux-framework/flux-core#5304 and passes the test in PR #1044, ~~but it's WIP because the behavior for non-exclusive resource requests is still suspect~~.

I chose to place the check in the by_excl part of prune because there can only be a single allocation in (*m_graph)[u].schedule.allocations for an exclusive vertex. While checking at non-exclusively allocated vertices is possible, the logic for determining which jobid is delayed via comparing availability counts returned by planner to the jobs in the (*m_graph)[u].schedule.allocations map and then determining the true number of available resources is much more complex.

Only checking at exclusive vertices can involve a deeper traversal than necessary (i.e. by checking at higher, non-exclusive vertices), but I think it will correctly block allocations provided that leaf-level vertices must be allocated exclusively. That appears to be ensured here:

flux-sched/resource/traversers/dfu_impl.cpp

Line 116 in 65da8f2

if (exclusive_in && resource.exclusive == Jobspec::tristate_t::FALSE) {

There are two items to verify before the WIP can be dropped:

Make sure leaf-level vertices must be allocated exclusively
Check the behavior for upwalks into an auxiliary subsystem
Change test_expect_failure to test_expect_success in t/t1024-alloc-check.t after PR testsuite: add test for double-booking #1044 is merged

This PR should be merged after PR #1044.

garlick · 2023-07-08T20:37:07Z

Thanks @milroy for taking the time to write up the explanation.

#1044 has been merged so at some point, this PR can be rebased on master, and a commit can be added that changes the two test_expect_failure tests in t1024-alloc-check.t to test_expect_success.

grondo · 2023-07-10T21:21:25Z

Just checking in here since deadline for new packages for the next TOSS release is tomorrow. I think the CI failures are just due to the fact that the expect failure alloc-check tests need to be updated to expect success:

# 2 known breakage(s) vanished; please update test(s)

Happy to push an update here if that would help @milroy.

milroy · 2023-07-10T23:47:50Z

@grondo: just got out of meetings and rebased and pushed the update to the testsuite. I'm going to check the two remaining items in the description and drop the WIP tag if possible.

milroy · 2023-07-11T02:04:11Z

I've confirmed that the exclusivity check here:

flux-sched/resource/traversers/dfu_impl.cpp

Line 116 in 65da8f2

if (exclusive_in && resource.exclusive == Jobspec::tristate_t::FALSE) {

ensures that everything under slot must be allocated exclusively. For example, the following simple jobspec:

version: 9999
resources:
  - type: cluster
    count: 1
    with:
      - type: rack
        count: 1
        with:
          - type: node
            count: 1
            with:
              - type: socket
                count: 1
                with:
                  - type: slot
                    label: corelevel
                    count: 1
                    with:
                     - type: core
                       exclusive: false
                       count: 1

# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "app" ]
    slot: corelevel
    count:
      per_slot: 1

Produces the following (overly verbose- we'll need to fix that) match error:

resource-query> m allocate t/data/resource/jobspecs/exclusive/test001.yaml
ERROR: by_excl: exclusivity conflicts at jobspec= : vertex=core0by_excl: exclusivity conflicts at jobspec= : vertex=core1
[...]

From that standpoint this PR should work as intended. However, is Fluxion conforming to RFC 14 with the jobspec above? In particular, here: https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_14.html#reserved-resource-types, which states:

A task slot SHALL have at least one edge specified using with:, and the resources associated with a slot SHALL be exclusively allocated to the program described in the jobspec, unless otherwise specified in the exclusive field of the associated resource.

The last sentence implies exclusive: false at a leaf vertex should be supported by Fluxion. @grondo and @garlick is my assessment correct?

garlick · 2023-07-11T02:49:16Z

Honestly I do not know. We don't have command line tooling that can express that AFAIK. Maybe we could open an issue on that one for later?

milroy · 2023-07-11T04:27:52Z

Maybe we could open an issue on that one for later?

My concern is that if it's decided that Fluxion needs to support requests like the one above then the fix in this PR won't work.

garlick · 2023-07-11T04:46:11Z

I could be mistaken but I think jobspec v1 would forbid that type of slot definition. See https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_25.html#v1-specific-resource-graph-restrictions. flux-core ensures that only v1 compliant jobspec can be submitted to the scheduler, so...is that sufficient justification for deferring this concern?

milroy · 2023-07-11T04:48:20Z

It looks like the intention was to make an exclusive: false request under slot invalid in RFC 25: flux-framework/rfc#189 and 3552b2c#r309027939 but I don't see that it ever happened.

milroy · 2023-07-11T05:04:51Z

so...is that sufficient justification for deferring this concern?

Yes, it is. I was a bit confused by the Resource Graph Restrictions section, but I think that together with:

a node type resource vertex MAY also contain the following optional keys:
exclusive

(implying no other vertices can contain that key) is strong enough.

Problem: issue flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to complete.

Problem: flux-framework#1044 introduced tests to check for allocation failures due to jobs that exceed their time limit. PR flux-framework#1044 had to define the two `alloc-check` tests as `test_expect_failure` because the mechanism to prevent multiple booking was not in place. Update the two `alloc-check` tests to test_expect_success since the PR of this commit fixes multiple booking in Fluxion.

codecov · 2023-07-11T05:25:10Z

Codecov Report

Merging #1046 (c961577) into master (506c0c4) will increase coverage by 0.0%.
The diff coverage is 100.0%.

❗ Current head c961577 differs from pull request most recent head d0716cc. Consider uploading reports for the commit d0716cc to get more accurate results

@@          Coverage Diff           @@
##           master   #1046   +/-   ##
======================================
  Coverage    74.3%   74.4%           
======================================
  Files          86      86           
  Lines        9432    9434    +2     
======================================
+ Hits         7017    7022    +5     
+ Misses       2415    2412    -3

Impacted Files	Coverage Δ
resource/traversers/dfu_impl.cpp	`85.0% <100.0%> (+<0.1%)`	⬆️

... and 1 file with indirect coverage changes

milroy · 2023-07-11T07:16:13Z

After inspecting the auxiliary subsystem traversals and testing the fix in PR #1041 I'm satisfied that this fix will likely work (and shouldn't break) the aux traversals.

Removing WIP and requesting an additional review.

garlick

LGTM!

Just gave this a quick sanity check on my test cluster with the artificially extended prolog/epilog and it seems to work well. (Not too different from the sharness test so maybe that doesn't add much).

grondo

LGTM too! Perhaps @jameshcorbett should add his review in case there is something here that might affect rabbit allocations.

trws · 2023-07-11T16:17:36Z

Yeah, this looks good. We may end up deciding to tweak it as part of the traversal refactor I want to do, but it should be easier to express after that too. Really good idea @milroy!

jameshcorbett

Looks reasonable to me!

milroy · 2023-07-11T16:46:46Z

Thanks @milroy for taking the time to write up the explanation.

No problem @garlick, and let me know if additional details would help. I didn't include a discussion on why handling early termination is much easier than handling jobs that run longer than their walltime specifically in planner-like logic because the change here sidesteps that difficulty.

Let me know if I should configure this PR to close issue #1043.

milroy · 2023-07-11T16:48:24Z

Thanks all for the discussion and reviews! Setting MWP.

milroy added the Status: In Progress label Jul 8, 2023

milroy changed the title ~~[WIP] traverser: prevent allocation of currently allocated resources~~ [WIP] Prevent allocation of currently allocated resources Jul 8, 2023

milroy force-pushed the alloc-fix branch from e1c19f4 to 95314c7 Compare July 8, 2023 06:41

milroy mentioned this pull request Jul 8, 2023

when a job exceeds its time limit, fluxion may reallocate resources before they have been freed #1043

Closed

milroy force-pushed the alloc-fix branch from 95314c7 to 0efd133 Compare July 10, 2023 17:36

milroy requested a review from jameshcorbett July 10, 2023 23:47

milroy added 2 commits July 10, 2023 22:13

milroy force-pushed the alloc-fix branch from 2253a08 to d0716cc Compare July 11, 2023 05:14

milroy removed the Status: In Progress label Jul 11, 2023

milroy changed the title ~~[WIP] Prevent allocation of currently allocated resources~~ Prevent allocation of currently allocated resources Jul 11, 2023

milroy requested review from garlick and trws July 11, 2023 07:17

garlick approved these changes Jul 11, 2023

View reviewed changes

grondo approved these changes Jul 11, 2023

View reviewed changes

garlick mentioned this pull request Jul 11, 2023

NEWS.md: add release notes for 0.28.0 #1048

Merged

jameshcorbett approved these changes Jul 11, 2023

View reviewed changes

milroy added the merge-when-passing mergify.io - merge PR automatically once CI passes label Jul 11, 2023

mergify bot merged commit bfc044e into flux-framework:master Jul 11, 2023

milroy deleted the alloc-fix branch July 11, 2023 16:50

milroy mentioned this pull request Jul 13, 2023

fix subsystems #1041

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent allocation of currently allocated resources #1046

Prevent allocation of currently allocated resources #1046

milroy commented Jul 8, 2023 •

edited

Loading

garlick commented Jul 8, 2023

grondo commented Jul 10, 2023 •

edited

Loading

milroy commented Jul 10, 2023

milroy commented Jul 11, 2023 •

edited

Loading

garlick commented Jul 11, 2023

milroy commented Jul 11, 2023

garlick commented Jul 11, 2023

milroy commented Jul 11, 2023

milroy commented Jul 11, 2023 •

edited

Loading

codecov bot commented Jul 11, 2023

milroy commented Jul 11, 2023

garlick left a comment

grondo left a comment

trws commented Jul 11, 2023

jameshcorbett left a comment

milroy commented Jul 11, 2023

milroy commented Jul 11, 2023

Prevent allocation of currently allocated resources #1046

Prevent allocation of currently allocated resources #1046

Conversation

milroy commented Jul 8, 2023 • edited Loading

garlick commented Jul 8, 2023

grondo commented Jul 10, 2023 • edited Loading

milroy commented Jul 10, 2023

milroy commented Jul 11, 2023 • edited Loading

garlick commented Jul 11, 2023

milroy commented Jul 11, 2023

garlick commented Jul 11, 2023

milroy commented Jul 11, 2023

milroy commented Jul 11, 2023 • edited Loading

codecov bot commented Jul 11, 2023

Codecov Report

milroy commented Jul 11, 2023

garlick left a comment

Choose a reason for hiding this comment

grondo left a comment

Choose a reason for hiding this comment

trws commented Jul 11, 2023

jameshcorbett left a comment

Choose a reason for hiding this comment

milroy commented Jul 11, 2023

milroy commented Jul 11, 2023

milroy commented Jul 8, 2023 •

edited

Loading

grondo commented Jul 10, 2023 •

edited

Loading

milroy commented Jul 11, 2023 •

edited

Loading

milroy commented Jul 11, 2023 •

edited

Loading